A MATLAB toolbox to fit and forecast growth trajectories using phenomenological growth models: Application to epidemic outbreaks

Background Simple dynamic modeling tools can be useful for generating real-time short-term forecasts with quantified uncertainty of the trajectory of diverse growth processes unfolding in nature and society, including disease outbreaks. An easy-to-use and flexible toolbox for this purpose is lacking. Results In this tutorial-based primer, we introduce and illustrate a user-friendly MATLAB toolbox for fitting and forecasting time-series trajectories using phenomenological dynamic growth models based on ordinary differential equations. This toolbox is accessible to various audiences, including students training in time-series forecasting, dynamic growth modeling, parameter estimation, parameter uncertainty and identifiability, model comparison, performance metrics, and forecast evaluation, as well as researchers and policymakers who need to conduct short-term forecasts in real-time. The models included in the toolbox capture exponential and sub-exponential growth patterns that typically follow a rising pattern followed by a decline phase, a common feature of contagion processes. Models include the 2-parameter generalized-growth model, which has proved useful to characterize and forecast the ascending phase of epidemic outbreaks, and the Gompertz model as well as the 3-parameter generalized logistic-growth model and the Richards model, which have demonstrated competitive performance in forecasting single peak outbreaks. The toolbox provides a tutorial for forecasting time-series trajectories that include the full uncertainty distribution, derived through parametric bootstrapping, which is needed to construct prediction intervals and evaluate their accuracy. Functions are available to assess forecasting performance across different models, estimation methods, error structures in the data, and forecasting horizons. The toolbox also includes functions to quantify forecasting performance using metrics that evaluate point and distributional forecasts, including the weighted interval score. Conclusions We have developed the first comprehensive toolbox to characterize and forecast time-series data using simple phenomenological growth models. As a contagion process takes off, the tools presented in this tutorial can facilitate policymaking to guide the implementation of control strategies and assess the impact of interventions. The toolbox functionality is demonstrated through various examples, including a tutorial video, and is illustrated using weekly data on the monkeypox epidemic in the USA.


Abstract 24
Background 25 Simple dynamic modeling tools can be useful for generating real-time short-term forecasts with 26 quantified uncertainty of the trajectory of diverse growth processes unfolding in nature and society, 27 including disease outbreaks. An easy-to-use and flexible toolbox for this purpose is lacking. 28

Results 29
In this tutorial-based primer, we introduce and illustrate a user-friendly MATLAB toolbox for 30 fitting and forecasting time-series trajectories using phenomenological dynamic growth models 31 based on ordinary differential equations. This toolbox is accessible to various audiences, including 32 students training in time-series forecasting, dynamic growth modeling, parameter estimation, 33 parameter uncertainty and identifiability, model comparison, performance metrics, and forecast 34 evaluation, as well as researchers and policymakers who need to conduct short-term forecasts in 35 real-time. The models included in the toolbox capture exponential and sub-exponential growth 36 patterns that typically follow a rising pattern followed by a decline phase, a common feature of 37 contagion processes. Models include the 2-parameter generalized-growth model, which has proved 38 useful to characterize and forecast the ascending phase of epidemic outbreaks, and the Gompertz 39 model as well as the 3-parameter generalized logistic-growth model and the Richards model,40 which have demonstrated competitive performance in forecasting single peak outbreaks. 41 The toolbox provides a tutorial for forecasting time-series trajectories that include the full 42 uncertainty distribution, derived through parametric bootstrapping, which is needed to construct 43 prediction intervals and evaluate their accuracy. Functions are available to assess forecasting 44 performance across different models, estimation methods, error structures in the data, and 45 forecasting horizons. The toolbox also includes functions to quantify forecasting performance 46 using metrics that evaluate point and distributional forecasts, including the weighted interval score. 47

Conclusions 48
We have developed the first comprehensive toolbox to characterize and forecast time-series data 49 using simple phenomenological growth models. As a contagion process takes off, the tools 50 are key for decision-making in all aspects of life including predicting the weather, commercial 63 product demand, the number of cases of an emerging infectious disease, and the growth or decline 64 of the economy. While simple statistical models such as ARIMA models have become popular for 65 forecasting time series [1][2][3][4][5], dynamical models based on rates of change equations (i.e., 66 differential equations) are less frequently applied by non-specialists in specific scientific fields 67 although forecasts based on these types of models can provide more information about the process 68 of interest by characterizing specific mechanisms and parameters involved in their dynamics [6-69 9]. For instance, simple phenomenological growth models which are discussed in this tutorial can 70 help characterize growth rates, scaling of growth, and turning points as well as predict the size of 71 a growing population (i.e., carrying capacity) or an epidemic outbreak at different time horizons 72 [10][11][12][13][14][15][16][17]. Hence, there is a need for an easy-to-use and flexible toolbox to generate short-term 73 forecasts from simple phenomenological growth models with quantified uncertainty of the 74 trajectory of diverse growth processes observed in nature and society, such as infectious disease 75 outbreaks [6]. 76 77 This tutorial paper introduces a user-friendly MATLAB toolbox to fit and forecast time-series 78 trajectories of infectious diseases using phenomenological dynamic growth models based on 79 ordinary differential equations that will find broad applications in the natural and social sciences. 80 This toolbox is written for various audiences, including students training in time-series forecasting, 81 dynamic growth modeling, parameter estimation, parameter uncertainty and identifiability, model 82 Table 1 lists the names of user functions associated with the toolbox along with a brief description  128   of their role. The internal functions associated with the toolbox are given in supplementary table  129 ( to 0 in the options_fit.m or options_forecast.m files. Least squares estimation is 143 achieved by searching for the set of parameters � that minimizes the sum of squared differences 144 between the observed data = 1 , 2 … . . and the best fit of the model (model mean) which 145 corresponds to ( , Θ). That is, Θ is estimated by Θ � = arg min ∑ ( � , Θ� − ) 2 =1 . In the 146 following section, we will describe different phenomenological growth models included in this 147 toolbox for the expected epidemic trajectory curve ( , Θ). 148 This parameter estimation method weights each data point equally and does not require a specific 150 distributional assumption for , except for the first moment [ ] = ( ; ). That is, the mean 151 of the observed data at time t is equivalent to the expected count denoted by ( , Θ) at time t [24]. 152 This method yields asymptotically unbiased point estimates regardless of any misspecification of 153 the variance-covariance error structure. Hence, the estimated model mean ( , Θ � ) yields the best 154 fit to the observed data in terms of the squared L2 norm. We can use the fmincon function in 155 MATLAB to set the optimization problem. In addition to the underlying normal distribution 156 assumption linked to nonlinear least squares fitting (method1=0), the uncertainty of the 157 parameters can be quantified using a Poisson or negative binomial error structure using parameter 158 <dist1> as described in the next section. Finally, we also employ MATLAB's MultiStart feature 159 to specify the number of random initial guesses of the model parameters using the parameter 160 The more general form of variance is 2 = + (i.e., method1=5 in options_fit.m 208 or options_forecast.m) with any −∞ < < ∞. Then the full log-likelihood (1.1) can be 209 expressed as follows: 210

Parametric bootstrapping 224
To quantify parameter uncertainty, we follow a parametric bootstrapping approach which allows 225 the computation of standard errors and related statistics without closed-form formulas [26]. We 226 generate bootstrap samples from the best-fit model ( , � ), with an assumed error structure, 227 specified using parameter <dist1> in the options_fit.m or options_forecast.m 228 files, to quantify the uncertainty of the parameter estimates and construct confidence intervals. 229 Typically, the error structure in the data is modeled using a probability model such as the Poisson 230 or negative binomial distribution. Using nonlinear least squares (method1=0), in addition to a 231 normally distributed error structure (dist1=0), we can also assume a Poisson (dist1=1) or a 232 negative binomial distribution (dist1=2), whereby the variance-to-mean ratio is empirically 233 estimated from the time series. To estimate this constant ratio, we group a fixed number of 234 observations (e.g., 7 observations for daily data into a bin across time), calculate the mean and 235 variance for each bin, and then estimate a constant variance-to-mean ratio by calculating the 236 average of the variance-to-mean ratios over these bins. Using maximum likelihood estimation, we 237 can estimate parameter uncertainty for Poisson and negative binomial error structures in the data 238

Model-based forecasts with quantified uncertainty 254
Based on the best-fit model � , Θ � �, we can make days ℎ ahead forecasting using ( + ℎ, Θ � ). 255 The uncertainty of the forecasted value can be obtained using the previously described parametric 256 bootstrap method. Let 257 Models commonly used to study the growth pattern of infectious disease outbreaks often assume 277 exponential growth in the absence of control interventions (compartmental models, for example); 278 however, growth patterns are likely slower than exponential for some diseases depending on the 279 transmission mode and population structure. For example, Ebola spreads via close contact so sub-280 exponential growth patterns would be expected in a constrained population contact structure [27]. 281 The generalized growth model (GGM) [21] includes a "deceleration of growth" parameter, p 282 (range: [0, 1]), that relaxes the assumption of exponential growth. A value of p=0 represents 283 constant (linear) growth, while a value of p=1 indicates exponential growth. If 0<p<1, the growth 284 pattern is characterized as sub-exponential or polynomial. 285 The GGM is as follows: 286 where C(t) describes the cumulative number of cases at time t, ( ) is the incidence curve, r is the 288 growth rate parameter (r>0), and p, again, is the deceleration of growth parameter [21]. The Generalized Logistic growth model (GLM) has three parameters and is given by: The growth scaling parameter, , is also used in the GLM to model a range of early epidemic 298 growth profiles ranging from constant incidence ( = 0 ), polynomial (0 < < 1), and 299 exponential growth dynamics ( = 1). When = 1, this model reduces to the logistic growth 300 model (flag1=3). The remaining model parameters are as follows: is the growth rate, and 0 301 is the final cumulative epidemic size. For this model, we estimate Θ = ( , , 0 ) where ( , Θ) = 302 ′ ( ), and the initial number of cases (0) is fixed according to the first observation in the data. 303 The GLM model has been employed to generate short-term forecasts of Zika, Ebola, and COVID-304 19 epidemics [10,14,15,28] where is the growth rate, is a scaling parameter and 0 is the final epidemic size. The Richards 316 model has been employed to generate short-term forecasts of SARS, Zika, Ebola, and COVID-19 317 epidemics [10,11,14,15,28]

Quality of model fit
To assess the quality of model fit, we can compare the AICc (Akaike Information Criterion) values 344 of the best-fit models. The is given by [33,34]: 345 where is the number of model parameters and is the number of data points. Specifically for 347 normal distribution, the is 348 . 379 The coverage of the 95% prediction interval (PI) corresponds to the fraction of data points that 380 fall within the 95% PI, calculated as 381  For the purposes of this toolbox, the time-series data will be stored in the 'input' folder and needs 444 to be a text file with the extension *.txt. The first column should correspond to the time index: 445 1,2,3, ..., and the second corresponds to the temporal incidence data. If the time series file contains 446 cumulative incidence count data, the name of the time series data file must start with "cumulative". 447

448
To illustrate the methodology presented in this tutorial paper, we employ the weekly incidence  (1,1,32) 507 508 This function will store the following .csv files in the output folder: 509 1) The model fit to the data: 510 Fit-flag1-1-i-1-fixI0-1-method-0-dist-0-calibrationperiod-511 32-horizon-0-monkeypox-cases.csv 512 2) Model parameter estimates: 513 parameters-rollingwindow-flag1-1-fixI0-1-method-0-dist-0-514 tstart-1-tend-1-calibrationperiod-32-horizon-0-monkeypox-515 cases.csv 516 3) Calibration performance metrics: 517 parameters-rollingwindow-flag1-1-fixI0-1-method-0-dist-0-518 tstart-1-tend-1-calibrationperiod-32-horizon-0-monkeypox-519 cases.csv 520 521 For this example, the resulting calibration performance metrics indicate that the 95% prediction 522 intervals include all the data points comprising the epidemic curve of monkeypox in the USA (i.e., 523 coverage of the 95% PI is 100%) (see Table 2). This function also plots the temporal sequence of 524 parameter estimates and their uncertainty obtained from the rolling-window analysis whenever the 525 value of the parameter <tend1> is greater than parameter <tstart1>. For instance, after 526 running the function Run_Fit_GrowthModels(1,3,30)to generate a rolling window 527 analysis of model fits with window size 30 and start times at 1,2, and 3, we can run the function 528 plotFit_GrowthModels (1,3,30  We can also assess the fit of the Richards model to the monkeypox incidence curve (flag1=4 in 544 options_fit.m file) and compare the quality of the model fit to that obtained using the 545 generalized logistic growth model using the performance metrics. Figure  The calibration performance metrics of the generalized logistic growth model and the Richards 558 model (see Table 2) indicate that the Richards model yields a better fit to the data in terms of the 559 MAE, MSE, and WIS while both models achieved a 100% coverage of the 95% prediction interval. The forecasting performance metrics of the generalized logistic growth model and the Richards 652 model based on the first 10 weeks of the epidemic curve (shown in In this tutorial-based primer, we have introduced the first toolbox to fit and forecast time-series 667 trajectories using phenomenological dynamic growth models with applications in natural and 668 social sciences. In particular, the models included in the toolbox have been frequently applied to 669 characterize and forecast epidemic trajectories in near real-time [10,11,14,15,18,21,28]